Assignment

This post is built for Assignment of ISSS608 Visual Analytics and Application.

LI Yuhong
07-05-2021

1. Overview

1.1 Background

This assignment topic is based on Mini-Challenge 2 of VAST Challenge 2021, and there is background introduction of VAST Challenge 2021.

In the roughly twenty years that Tethys-based GAStech has been operating a natural gas production site in the island country of Kronos, it has produced remarkable profits and developed strong relationships with the government of Kronos. However, GAStech has not been as successful in demonstrating environmental stewardship.

In January, 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected in the disappearance, but things may not be what they seem.

1.2 Objective

The data used in this research is the movement and tracking data of employees. GAStech provides many of their employees with company cars for their personal and professional use, but unbeknownst to the employees, the cars are equipped with GPS tracking devices.

Based on the given tracking data for the two weeks leading up to the disappearance, as well as credit card transactions and loyalty card usage data, we aim to use appropriate visually driven data analysis techniques to complete two objectives:

  1. Identify anomalies and suspicious behaviors.

  2. Identify the owners of each credit and loyalty card.

2. Literature Review

Data visualization is the practice of translating information into a visual context, such as a map or graph, to make data easier for the human brain to understand and pull insights from. The main goal of data visualization is to make it easier to identify patterns, trends and outliers in large data sets. It provides a quick and effective way to communicate information in a universal manner using visual information.

In this research, the two main data visualization methods used to attain research goals and answer questions are graphs and maps. Graphs include stable graphs and interactive graphs and the main R packages used are ggplot and plotly. The ggplot package is the most popular data visualization package in the R community. It was created by Hadley Wickham in 2005. It was implemented based on Leland Wilkinson’s Grammar of Graphics — a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers. While using ggplot2, we provide the data, call specific function, map our desired variables to aesthetics, define graphical arguments. The plotly is a R package for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js.

3. Installation and load of R packages

The code chunks below are used to install and load the packages in R.

packages = c("tidyverse","tidyr","tidyverse","dplyr","sp","raster","sf",
             "vctrs","clock","tmap","rgdal","readr","ggplot2","plotly","tmap","gganimate","av","gifski",
             "igraph","tidygraph","ggraph","visNetwork","lubridate","DT","collapsibleTree")
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

4. Data Wrangling and Extraction

4.1 Data Import and View

The data was imported by using read.csv().

carassign <- read_csv("./MC2/car-assignments.csv")

creditcard <- read.csv("./MC2/cc_data.csv")

gps <- read_csv("./MC2/gps.csv")

loyaltycard <- read.csv("./MC2/loyalty_data.csv")

Abila_st <- st_read(dsn = "MC2/Geospatial",
                    layer = "Abila")
Reading layer `Abila' from data source 
  `D:\liyuhong0110\BlogFile\_posts\2021-07-05-assignment\MC2\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84

Change the data type of MC2-tourist.jpg to tif by save as. (Figure 1)

Figure 1

View the data by the following coding.

head(carassign)
# A tibble: 6 x 5
  LastName FirstName CarID CurrentEmploymentType CurrentEmploymentTit~
  <chr>    <chr>     <dbl> <chr>                 <chr>                
1 Calixto  Nils          1 Information Technolo~ IT Helpdesk          
2 Azada    Lars          2 Engineering           Engineer             
3 Balas    Felix         3 Engineering           Engineer             
4 Barranco Ingrid        4 Executive             SVP/CFO              
5 Baza     Isak          5 Information Technolo~ IT Technician        
6 Bergen   Linnea        6 Information Technolo~ IT Group Manager     
head(creditcard)
         timestamp            location price last4ccnum
1 01/06/2014 07:28 Brew've Been Served 11.34       4795
2 01/06/2014 07:34    Hallowed Grounds 52.22       7108
3 01/06/2014 07:35 Brew've Been Served  8.33       6816
4 01/06/2014 07:36    Hallowed Grounds 16.72       9617
5 01/06/2014 07:37 Brew've Been Served  4.24       7384
6 01/06/2014 07:38 Brew've Been Served  4.17       5368
head(gps)
# A tibble: 6 x 4
  Timestamp           id   lat  long
  <chr>            <dbl> <dbl> <dbl>
1 01/06/2014 06:28    35  36.1  24.9
2 01/06/2014 06:28    35  36.1  24.9
3 01/06/2014 06:28    35  36.1  24.9
4 01/06/2014 06:28    35  36.1  24.9
5 01/06/2014 06:28    35  36.1  24.9
6 01/06/2014 06:28    35  36.1  24.9
head(loyaltycard)
   timestamp            location price loyaltynum
1 01/06/2014 Brew've Been Served  4.17      L2247
2 01/06/2014 Brew've Been Served   9.6      L9406
3 01/06/2014    Hallowed Grounds 16.53      L8328
4 01/06/2014        Coffee Shack 11.51      L6417
5 01/06/2014    Hallowed Grounds 12.93      L1107
6 01/06/2014 Brew've Been Served  4.27      L4034

The overview of data are shown in Table 1.

Table 1

S/N Name of Data Number of Observations Variables
1 carassign 44 LastName, FirstName, CarID, CurrentEmploymentType, CurrentEmploymentTitle
2 creditcard 1490 timestamp, location, price, last4ccnum
3 gps 685,169 Timestamp, id, lat, long
4 loyaltycard 1392 timestamp, location, price, loyaltynum

4.2 Data Wrangling and Extraction

4.2.1 Correct Data Type

The columns of “timestamp” should be Date data type instead of character, so we change the data type as the following codes.

# Change data type of columns "timestamp"
creditcard$timestamp <- date_time_parse(creditcard$timestamp,
                                 zone="",
                                 format="%m/%d/%Y %H:%M")

loyaltycard$timestamp <- date_time_parse(loyaltycard$timestamp,
                                 zone="",
                                 format="%m/%d/%Y")

gps$Timestamp <- date_time_parse(gps$Timestamp,
                                 zone="",
                                 format="%m/%d/%Y %H:%M")

4.2.2 Adjust Incorrect Columns

There are some incorrect columns shown in Figure 2 in the dataset of creditcard and loyaltycard.

Figure 2

For loyaltycard dataset, we correct this problem by the following codes and get the dataset of loyaltycard_final dataset which will be used in the following analysis.

# Build the wrong loyaltycard dataset
wrongloyalty <- loyaltycard %>%
  filter(loyaltynum=="")
wrongloyalty <- wrongloyalty[,-4]
names(wrongloyalty)[3] <-"loyaltynum"
wrongloyalty <- wrongloyalty %>%
  separate(location,into = c("location","price"),sep ="\\?",convert = TRUE)
head(wrongloyalty)
   timestamp       location price loyaltynum
1 2014-01-06 Katerina’s Caf 33.54      L6110
2 2014-01-06 Katerina’s Caf 38.65      L6110
3 2014-01-06 Katerina’s Caf 26.46      L8328
4 2014-01-06 Katerina’s Caf 28.00      L6267
5 2014-01-06 Katerina’s Caf 26.51      L9406
6 2014-01-07 Katerina’s Caf 37.44      L2343
# Remove Null rows
loyaltycard <- loyaltycard %>%
  filter(loyaltynum!="")

# Change the data type of "price" column
loyaltycard$price <- as.numeric(loyaltycard$price)
class(loyaltycard$price)
[1] "numeric"
# Combine wrongloyaltycard and loyaltycard dataset to build final loyalty card dataset
loyaltycard_final <- rbind(loyaltycard,wrongloyalty)

After completing the same actions in the creditcard dataset as following, we got the dataset of creditcard_final dadtaset.

# Build the wrong creditcard dataset
wrongcredit <- creditcard %>%
  filter(is.na(last4ccnum))
wrongcredit <- wrongcredit[,-4]
names(wrongcredit)[3] <-"last4ccnum"
wrongcredit <- wrongcredit %>%
  separate(location,into = c("location","price"),sep ="\\?",convert = TRUE)
head(wrongcredit) 
            timestamp       location price last4ccnum
1 2014-01-06 12:56:00 Katerina’s Caf 13.68       8332
2 2014-01-06 13:28:00 Katerina’s Caf 33.54       8411
3 2014-01-06 13:46:00 Katerina’s Caf 18.56       6899
4 2014-01-06 13:50:00 Katerina’s Caf 13.59       1874
5 2014-01-06 13:50:00 Katerina’s Caf 32.64       9617
6 2014-01-06 13:53:00 Katerina’s Caf 36.95       2142
# Remove Null rows
creditcard <- creditcard %>%
  filter(last4ccnum!="NA")

# Combine wrongcredit and creditcard dataset to build final credit card dataset
creditcard_final <- rbind(creditcard,wrongcredit)

FirstName and LastName columns can be combined to build a new column in the carassign dataset, and then a new dataset called carassign_final was built.

# Combine FirstName and LastName columns
carassign_final <- carassign %>% 
  unite(Name,c("FirstName","LastName"), sep = " ")

4.2.3 Check Missing Values

Then we check the missing values and view the wrangled dataset.

# Identify missing values
which(rowSums(is.na(loyaltycard_final))==TRUE)
integer(0)
which(rowSums(is.na(creditcard_final))==TRUE)
integer(0)
which(rowSums(is.na(carassign_final))==TRUE)
[1] 36 37 38 39 40 41 42 43 44
which(rowSums(is.na(gps))==TRUE)
integer(0)
# Check the dataset
head(loyaltycard_final)
   timestamp            location price loyaltynum
1 2014-01-06 Brew've Been Served  4.17      L2247
2 2014-01-06 Brew've Been Served  9.60      L9406
3 2014-01-06    Hallowed Grounds 16.53      L8328
4 2014-01-06        Coffee Shack 11.51      L6417
5 2014-01-06    Hallowed Grounds 12.93      L1107
6 2014-01-06 Brew've Been Served  4.27      L4034
head(creditcard_final)
            timestamp            location price last4ccnum
1 2014-01-06 07:28:00 Brew've Been Served 11.34       4795
2 2014-01-06 07:34:00    Hallowed Grounds 52.22       7108
3 2014-01-06 07:35:00 Brew've Been Served  8.33       6816
4 2014-01-06 07:36:00    Hallowed Grounds 16.72       9617
5 2014-01-06 07:37:00 Brew've Been Served  4.24       7384
6 2014-01-06 07:38:00 Brew've Been Served  4.17       5368
head(carassign_final)
# A tibble: 6 x 4
  Name            CarID CurrentEmploymentType  CurrentEmploymentTitle
  <chr>           <dbl> <chr>                  <chr>                 
1 Nils Calixto        1 Information Technology IT Helpdesk           
2 Lars Azada          2 Engineering            Engineer              
3 Felix Balas         3 Engineering            Engineer              
4 Ingrid Barranco     4 Executive              SVP/CFO               
5 Isak Baza           5 Information Technology IT Technician         
6 Linnea Bergen       6 Information Technology IT Group Manager      
head(gps)
# A tibble: 6 x 4
  Timestamp              id   lat  long
  <dttm>              <dbl> <dbl> <dbl>
1 2014-01-06 06:28:00    35  36.1  24.9
2 2014-01-06 06:28:00    35  36.1  24.9
3 2014-01-06 06:28:00    35  36.1  24.9
4 2014-01-06 06:28:00    35  36.1  24.9
5 2014-01-06 06:28:00    35  36.1  24.9
6 2014-01-06 06:28:00    35  36.1  24.9

We can see that there are missing values (Figure 3) in rows from 36 to 44 in carassign_final dataset. It means that cardID is lack for truck drivers.

Figure 3

4.2.4 Data Extraction

After observing we find that only car ID from 1 to 35 will be used, so car ID included other number Figure 4 will be excluded from the analysis of gps dataset.

Figure 4

5. The Most Popular Locations

There are two dataset can be used to identify the popular locations, and after obversing the relationship of two dataset we find that there is some purchase was recorded in both dataset due to the same purchase time and purchase price. Hence, we will be based on loyaltycard_final and creditcard_final respectively.

5.1.1 Based on Loyalty Card Recording

To identify the the popular locations based on loyalty card dataset, we use the following steps.

# Calculate the number of locations appeared based on loyalty card
location_num_loyaltycard <- loyaltycard_final %>% 
  group_by(location) %>% 
  summarise(num_loyalty=n()) %>% 
  arrange(desc(num_loyalty))

# Plot the number of locations appeared based on loyalty card
p1 <- plot_ly(data = location_num_loyaltycard,
              x = ~location,y = ~num_loyalty,
              type="bar",
              text = ~paste("Location:", location,
                      "<br>Frequency:", num_loyalty)) %>% 
  layout(yaxis = list(title = 'Frequency'),
         xaxis = list(title = 'Location',
                      categoryorder = "array",
                      categoryarray = ~num_loyalty),
         title = 'The Frequency of Each Location Based on Loyalty Card')
p1

The graph shows that the five most popular locations is: Katerina’s Caf, Hippokampos, Guy’s Gyros, Brew’ve Been Served, and Hallowed Grounds (Figure 5).

Figure 5

5.1.2 Based on Credit Card Recording

To identify the the popular locations based on credit card dataset, we use the following steps.

# Calculate the number of locations appeared based on credit card
location_num_creidtcard <- creditcard_final %>% 
  group_by(location) %>% 
  summarise(num_credit=n()) %>% 
  arrange(desc(num_credit))

# Plot the number of locations appeared based on credit card
p2 <- plot_ly(data = location_num_creidtcard,
              x = ~location,y = ~num_credit,
              type="bar",
              text = ~paste("Location:", location,
                      "<br>Frequency:", num_credit)) %>% 
  layout(yaxis = list(title = 'Frequency of happened purchase'),
         xaxis = list(title = 'Location',
                      categoryorder = "array",
                      categoryarray = ~num_credit),
         title = 'The Frequency of Each Location Based on Credit Card')
p2

The graph shows that the five most popular locations is: Katerina’s Caf, Hippokampos, Guy’s Gyros, Brew’ve Been Served, and Hallowed Grounds (Figure 6), which is the same as the result based on loyalty card dataset.

Figure 6

The hour and minute was recorded in Timestamp columns in creditcard dataset, but not in loyalty dataset. Hence, we will figure out the popular days based on loyalty dataset, and find the pupular moment based on creditcard dataset.

5.2.1 Based on Loyalty Card Recording

Write the following codes to draw the graph for the five most popular locations.

# Build new dataset for the five most popular locations
loyalty_day <- loyaltycard_final %>% 
  filter(location==c("Katerina’s Caf", "Hippokampos", "Guy's Gyros", 
                     "Brew've Been Served", "Hallowed Grounds")) 

# Build a new column for day and change the data type
loyalty_day$day <- as.numeric(get_day(loyalty_day$timestamp))

# Wrangle the dataset for the frequency of purchase for every day
loyalty_day_num <- loyalty_day %>% 
  group_by(day) %>% 
  summarise(frequency=n())

# Draw the graph
p3 <- plot_ly(data = loyalty_day_num,
              x = ~day, 
              y = ~frequency, 
              type = 'scatter', 
              mode = 'lines',
              text = ~paste("Day:", day,
                      "<br>Frequency:", frequency),
              name="Trend") %>% 
  layout(yaxis = list(title = 'Frequency'),
         xaxis = list(title = 'Day'),
         title = "Frequency of Loyalty Card Purchase for Five Most Pupular Locations") %>% 
  add_trace(mode = 'markers',
            name="Detail")

p3

The line graph shows that 2014/1/6, 2014/1/16 and 2014/1/17 are the most popular days for the five most popular locations due to the peak that appeared in these dates.

5.2.2 Based on Credit Card Recording

Write the following codes to draw the 2D Histograms graph for the 20 most popular locations.

# Build new dataset for the five most popular locations
credit_hour <- creditcard_final %>% 
  filter(location==c("Katerina’s Caf", "Hippokampos", "Guy's Gyros", 
                     "Brew've Been Served", "Hallowed Grounds")) 

# Build a new column for day and change the data type
credit_hour$hour <- as.numeric(get_hour(credit_hour$timestamp))

# Wrangle the dataset for the frequency of purchase for every day
credit_hour_num <- credit_hour %>% 
  group_by(location,hour) %>% 
  summarise(frequency=n())

# Draw the graph
p4 <- ggplot(data=credit_hour_num,
             aes(x=location,y=hour,fill=frequency))+
  geom_tile()+
  scale_y_continuous(limits = c(0,24),
                     breaks = c(0,2,4,6,8,10,12,14,16,18,20,22,24))+
  scale_fill_gradient2(low="blue",
                       high="red")+
  labs(x="Location",
       y="Hour",
       title="Frequency of Credit Card Purchase for Five Most Popular Locations")+
  theme(plot.title=element_text(hjust=0.5))
p4

Based on the above graph, we can find that 7:00-8:00 is popular for Brew’ve Been Served and Hallowed Grounds, and 13:00-14:00 and 19:00-20:00 is popular for Katerina’s Caf, Hippokampos, Guy’s Gyros.

5.2.3 Anomalies

  1. Popular and Important Coffee Shops A large part of the most popular locations for GAStech employees are coffee shops. For example, three locations, Katerina’s Caf, Brew’ve Been Served, and Hallowed Grounds in the five most popular locations are coffee shops.

Based on the frequency of purchase, we knew that there are a cycle of low frequency of purchase for the five most popular locations, and the time interval was 5 days. We can speculate that 2014/1/11, 2014/1/12, 2014/1/18 and 2014/1/19 are weekends. And after checking the calender, we can confirm our guess. Hence, we can conclude that these locations were the important locations for GAStech employees to break, have meals and work in weekdays, and employees went to these location less in weekends. The coffee shops provided suitable and comfortable environment for the GAStech employees who leaf companies.

Figure 7

  1. Inconvenience for Having Meals According to the popular locations, we can know that restaurants were also the locations that the employees often drove to during the time for lunch and dinner. It means that employees could not have meals within GAStech or nearby GAStech, which caused inconvenience and time waste for employees.

Figure 8

  1. Overtime Work Drinking coffee as breakfast or lunch is popular for GAStech employees, but the uncommon time of drinking coffee is likely to lead to overtime work. According to the frequency of purchase, Guy’s Gyros, Hippokampos, Katerina’s Caf were popular in 19:00-20:00, but coffee is not a wise choice to drink in the evening due to the probable insomnia problems. However, someone will drink coffee to keep the spirit in the evening in order to deal with their work. We can speculate overtime work is common for GAStech employees.

Figure 9

5.2.4 Recommendation

  1. GAStech should improve efficiency and arrange daytime reasonably to reduce overtime work and keep employees healthy.

  2. GAStech should provide more comfortable restaurants for employees within the area of company, which will be more convenient for employees to have their meals nearby instead of driving to have meals.

  3. For some necessary overtime work, the company should provide comfortable working space in the evening.

6. Owner Inference

6.1 Plot GPS Path and Stop-by Points for Each Car

Using gps dataset, we can get maps of GPS paths and stop-by points for each car.

Firstly, plot the map of Kronos and build the dataset of GPS path.

# Import MC2-tourist.tif
bgmap <- raster("MC2/MC2-tourist.tif")

# Tidy up dataset
gps_final1 <- gps

gps_final1$id <- as_factor(gps_final1$id)
gps_final1$day <- as.factor(get_day(gps_final1$Timestamp))

gps_sf <- st_as_sf(gps_final1,                   
                   coords=c("long","lat"),
                   crs=4326)

# Build the dataset of GPS path
gps_path <- gps_sf %>%
  group_by(id,day) %>%
  summarize(m=mean(Timestamp),
             do_union=FALSE) %>%
  st_cast("LINESTRING")

And then we should figure out the stop-by points. In this part, the stop-by points mean that the person stops the cars for a period larger than 300 seconds (5 minutes). Start time means the start time of stopping cars and end time means the end time of stopping cars (restart cars and move forward). Time different (tdiff) is calculated by End time-Start time, and the stops of cars with a period larger than 300 seconds will be selected.

# Calculate and select rows with the time difference larger than 300 seconds
gps_stop <- gps

gps_stop <- gps_stop %>% 
  arrange(id,Timestamp)

gps_stop$tdiff <- unlist(tapply(gps_stop$Timestamp, INDEX = gps_stop$id,
                          FUN = function(x) c(0, `units<-`(diff(x),"secs"))))

gps_stop <- gps_stop %>% 
  filter(tdiff>300) %>% 
  rename(endtime=Timestamp) %>% 
  mutate(starttime=endtime-tdiff)

gps_stop_sf <- st_as_sf(gps_stop,                   
                   coords=c("long","lat"),
                   crs=4326)

Plot the GPS path and stop-by points for each car, and we use the graph of the car whose ID is 1 as the example.

# Set the tmap mode to "view"
tmap_mode("view")

# Select the dataset for ID 1 car 
gps_stop_sf_selected <- gps_stop_sf %>%
  filter(id==2)
gps_path_selected <- gps_path %>%
  filter(id==2)

# Plot the map, GPS path and stop-by points
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_stop_sf_selected) +
  tm_dots(size = 0.075)+
  tm_shape(gps_path_selected) +
  tm_lines(col = "day",
           lwd = 3)

Based on the GPS path and stop-by points for each car, we can reassess the anomalies in 5.2.3. Coffee shops were indeed the popular locations due to frequent appears of stop-by points in coffee shops and someone would go to coffee shops in the evening after leaving the company (may work in coffee shops), as showing the following graph for car ID 6 in the date of 6.

# Select the dataset for ID 6 car in the date of 6
gps_stop_sf_selected_work <- gps_stop_sf %>%
  filter(id==6,get_day(endtime)==6)
gps_path_selected_work <- gps_path %>%
  filter(id==6,day==6)

# Plot the map, GPS path and stop-by points
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_stop_sf_selected_work) +
  tm_dots(size = 0.075)+
  tm_shape(gps_path_selected_work) +
  tm_lines(col = "day",
           lwd = 3)

After ID 6 car leaf GAStech at 17:30 at the date of 2012/1/6, the user of this car arrived Guy’s Gyros (coffee shops) at 18:45 and arrived Katerina’s Caf at 20:14. The user may work in these two coffee shops.

Figure 10

6.2 Methods of Matching Owners for Cards

After observing, we knew that there are some purchase records show the same purchase price, purchase locations and purchase date in the dataset for both credit card and loyalty card, so we can speculate that the purchse records were done at the same time by the same person. Hence, we can match credit cards and loyalty cards firstly and then figure out the owners of them.

# Match cards by the same purchase price, purchase locations and purchase date
creditcard_final_day <- creditcard_final
creditcard_final_day$timestamp <- format(creditcard_final_day$timestamp,format="%Y-%m-%d")
loyaltycard_final_day <- loyaltycard_final
loyaltycard_final_day$timestamp <- as.character(loyaltycard_final_day$timestamp)
loyal_creidt <- inner_join(creditcard_final_day,loyaltycard_final_day,
                          by=c("price","location","timestamp")) %>% 
  select(last4ccnum,loyaltynum)

# Build dataset for the result of matching
loyal_creidt_unique <- unique(loyal_creidt) 
loyal_creidt_unique <- loyal_creidt_unique%>% 
  group_by(last4ccnum) %>%
  summarise(loyaltynum = paste(loyaltynum, collapse = ", "))

We build the network graph to show the relationship between cards and locations according to the purchase records of credit cards.

# Transforming data
creditcard_final_net <- creditcard_final

creditcard_final_net$last4ccnum <- as.character(creditcard_final_net$last4ccnum)
creditcard_final_net$Day  <-  get_day(creditcard_final_net$timestamp)
creditcard_final_net$Hour <-  get_hour(creditcard_final_net$timestamp)

# Creating nodes list
sources <- creditcard_final_net %>%
  distinct(last4ccnum) %>%
  rename(label = last4ccnum)
destinations <- creditcard_final_net %>%
  distinct(location) %>%
  rename(label = location)
cc_nodes <- full_join(sources, 
                   destinations, 
                   by = "label")
cc_nodes <- cc_nodes %>% 
  rowid_to_column("id")

# Creating and tidying edges list
edges <- creditcard_final_net %>%  
  group_by(last4ccnum, location, Day, Hour) %>%
  summarise(weight = n()) %>% 
  ungroup()

cc_edges <- edges %>% 
  left_join(cc_nodes, 
            by = c("last4ccnum" = "label")) %>% 
  rename(from = id)

cc_edges <- cc_edges %>% 
  left_join(cc_nodes, 
            by = c("location" = "label")) %>% 
  rename(to = id)

cc_edges <- select(cc_edges, from, to, 
                   Day, Hour, weight)

cc_graph <- tbl_graph(nodes = cc_nodes, 
                      edges = cc_edges, 
                      directed = FALSE)

# Plot the interactive network
visNetwork(cc_nodes,
           cc_edges) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(highlightNearest = TRUE,
             nodesIdSelection = TRUE) %>%
  visLegend() %>%
  visLayout(randomSeed = 123)

By choosing the credit card ID we can know the locations that this card was used in the given period of two weeks. To know more about the details, like purchase frequency in each location and corresponding loyalty cards, we built a data table that can help us to finding the card owners more accurately.

# Prepare dataset
creditcard_final_loyalty_DT <- creditcard_final %>% 
  group_by(last4ccnum,location) %>% 
  summarise(purchasenum=n())
creditcard_final_loyalty_DT <- left_join(creditcard_final_loyalty_DT,
                                           loyal_creidt_unique,
                                           by="last4ccnum")
creditcard_final_loyalty_DT <- creditcard_final_loyalty_DT %>% 
  arrange(desc(purchasenum))

# Build DT
DT::datatable(creditcard_final_loyalty_DT)

To find the owner of cards we will take the following steps (Use the credit card whose last4ccnum is 4795 as the example):

  1. Choose the ID of 4795 in the network graph and then the locations that this card was used in were shown.

Figure 11

  1. After comparing the purchase locations with GPS path and stop-by points graphs of all the cars, we can find the Car ID of probable owners because the stop-by points were probably the same as or similar as the purchase locations.

  2. The following GPS path and stop-by points graphs was selected as the best candidate. The car ID of this graph is 17, and the name of car user is Flecha Sven.

Figure 12

  1. In order to ensure the accuracy of judging, we should choose some locations to compare time of stop-by and time of purchase records. If the most time is similar and reasonable, we could be more confident about the result of matching. Figure 13

Figure 14

  1. We search 4795 in the data table and then we got the corresponding loyalty card number of L8566 and L2070.

Figure 15

  1. We can conclude that Flecha Sven may be the owner of credit card 4795 and loyalty card L8566 and L2070.

6.3 Uncertainties of Methods

  1. Because we needed to compare the purchase locations with GPS path and stop-by points graphs of all the cars, there were potential possibility of judging wrongly and mismatching the owner with the cards.

  2. Long time was needed to judge and match each card and each person.

6.4 Uncertainties in Dataset

  1. The lack of geographical coordinates for each locations caused difficulties for us to confirm each location, and we could only find the locations in the map. And if there were geographical coordinates for each locations, we could plot the purchase records in the map directly and then compared with the GPS Path to match owners and cards. It will be more accurate and efficient to complete the task of matching.

  2. The purchase records of some credit cards were strange. For these credit cards, purchase records of one credit card were recorded in different loyalty cards. It seems that the owner of this credit card had two of more loyalty cards or the owner would borrow other persons’ loyalty cards.

  3. The timestamp column of loyalty card dataset is different from the timestamp column of credit card dataset, and it only shows the date but not the time in detail. Hence, we can not match the loyalty card number and credit card number accurately according to the same timestamp value.

7. Relationship of Employees

To find the relationship of employees, we should plot the network through tree map.

# Now we can plot the tree maps
p6 <- collapsibleTree(carassign_final,
                      hierarchy = c("CurrentEmploymentType", "CurrentEmploymentTitle","Name"),
                      root = "GASTech",
                      width = 800,
                      fill = c("seashell",
    rep("brown", length(unique(carassign_final$CurrentEmploymentType))),
    rep("khaki", length(unique(paste(carassign_final$CurrentEmploymentType,
                                     carassign_final$CurrentEmploymentTitle)))),
    rep("forestgreen", length(unique(paste(carassign_final$Name,
                                           carassign_final$CurrentEmploymentType))))))
p6

Based on the above graph, the official relationship within employees in GASTech can be got. For example, there were five employees (one is IT helpdesk, three are IT Technician and the remaining one is IT Group Manager) whose work type was relative to Information Technology. Hence, they were likely to be colleagues, and IT Group Manager, Linnea Bergen, was the leader of other four employees.

Figure 16

8. Suspicious Activities and Locations

In order to compare the path of each employee for every day, we divided the GPS path by every day as the following. And then we plotted the maps for all employees and then tried to find suspicious activities and locations. Make car of ID 19 as example.

# Try to plot the needed graphs
# Set the tmap mode to "view"
tmap_mode("view")

# Select the dataset for ID 1 car 
gps_stop_sf_selected <- gps_stop_sf %>%
  filter(id==19)
gps_stop_sf_selected_day <- gps_stop_sf_selected %>% 
  mutate(day=get_day(endtime))
gps_stop_sf_selected_day$day <- as.factor(gps_stop_sf_selected_day$day)
gps_path_selected <- gps_path %>%
  filter(id==19)

# Plot the map, GPS path and stop-by points
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_stop_sf_selected_day) +
  tm_dots(size = 0.5,
          col = "day")+
  tm_shape(gps_path_selected) +
  tm_lines(col = "day",
           lwd = 7)+
  tm_facets(by = "day",
            ncol = 2,
            free.scales = FALSE)+
  tm_layout(legend.title.size=1,
            legend.text.size = 0.6)

8.1 Desafio Golf Course

Desafio Golf Course seemed to be a popular locations for GASTech employees to relax in weekends, but too many employees went here at the same time in the weekend just before the event of disappear, like the owners of car with ID 4, 28, 31, 32, 35.

Figure 17. The GPS path of ID 28 on 2014/1/19

According to the GPS records of the car of ID 31, we could know that this owner drove this car seldom. However, the owner drove out from 2014/1/17 to 2014/1/19, and went to the Desafio Golf Course on 2014/1/19 which was the date just before celebration. It was strange for a person who drove less to drive for continuous three days suddenly.

Figure 18. The GPS path of ID 31 on 2014/1/19

8.2 A White House

There was a white house that some employees would start the path from here and return here after finishing whole day’s schedual (ID 2, 3, 9, 10, 11, 19 and so on). We could make a conjecture that this white house is a residential zone for these persons. It may not be a suspicious location that related to the disappear event, but it is helpful for us to confirm the function of living for this white house when analyzing the behaviors of employees.

Figure 19

Figure 20. The GPS path of ID 10 on 2014/1/7

8.3 Alberts Fine Clothing

GPS path map showed that many employeew went to Alberts Fine Clothing often, especially in the weekend days of 11, 12, 18 and 19. Though they probably made an appointment to go shopping together, but it was still very abnormal that several employees went to together. At the same time, Alberts Fine Clothing was an extremely popular locations among all the locations because almost every employees have the records of stopping and shopping here, though this clothing shop was far from GASTech. Although they might be buying the clothes for attending the celebration of their company, the extremely high frequency was suspicious.

Figure 21

Especially, the owner of the car with the ID of 7 and 32 went to Alberts Fine Clothing almost every weekday.

Figure 22. The GPS path of ID 7 on 2014/1/13

In the weekend, many employees, like the owner of ID 2 car, would leave their residential zones and make a special trip to this clothing store.

Figure 23. The GPS path of ID 2 on 2014/1/19

8.4 Roberts and Son’s

It was also a very popular locations for employees, but we didn’t make accurate justification about function or type of this shop from its name. Many

Figure 24

Some persons would drive for a bit long time to this location at noon after leave company, like owners of cars of ID 11 and 19.

Figure 25. The GPS path of ID 11 on 2014/1/13
Figure 26. The GPS path of ID 19 on 2014/1/15

8.5 Some Suspicious Activities

  1. Strange GPS Paths The GPA path of Cazar Gustav, whose carID is 9, was extremely chaos, and sometimes the path would cross some areas there were not roads. This GPS path was illogical and abnormal, and the probably reason was that GPS device was wrong or the user of this car liked to try some unusual shortcuts.

Figure 27

  1. Quivering Driving Paths The GPA path of Borrasca Isande, whose carID is 28, was quivering when running in the roads. We can suspect that the GPS device was also wrong or the user of this car was a crazy driver in the road.

Figure 28

9. Conclusion

In this research, we use the tracking data of cars, job information of each employees and purchase records of credit card and loyalty card to complete two main objectives of identifying suspicious behaviors and inferring the owners of each credit card and loyalty card by suitable techniques of visualization. After data wrangling and data extraction, the the most popular five locations and the popular periods were figured out, and then the anomalies, like overtime work and inconvenient meals, were found. Secondly, after adding GPS track data the method of matching owners with credit card and loyalty card was built, and we used one credit card as exmaple to show the steps in detail. At the same time, the relationship among all employees was shown by hierarchical graph based on the job information. Last but not least, according to the GPS path and stop-by points of each person, some suspicious activities and locations were figured out to help to retrieve the truth of disappearance.